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Abstract 

Background: Classification and naming is a key step in the analysis, understanding and adequate management of 
living organisms. However, where to set limits between groups can be puzzling especially in clonal organisms. 
Within the Mycobocterium tuberculosis complex (MTC), the etiological agent of tuberculosis (TB), experts have first 
identified several groups according to their pattern at repetitive sequences, especially at the CRISPR locus 
(spoligotyping), and to their epidemiological relevance. Most groups such as "Beijing" found good support when 
tested with other loci. However, other groups such as T family and T1 subfamily (belonging to the "Euro-American" 
lineage) correspond to non-monophyletic groups and still need to be refined. Here, we propose to use a method 
called Affinity Propagation that has been successfully used in image categorization to identify relevant patterns at 
the CRISPR locus in MTC. 

Results: To adequately infer the relative divergence time between strains, we used a distance method inspired by 
the recent evolutionary model by Reyes et al. We first confirm that this method performs better than the Jaccard 
index commonly used to compare spoligotype patterns. Second, we document the support of each spoligotype 
family among the previous classification using affinity propagation on the international spoligotyping database 
SpolDB4. This allowed us to propose a consensus assignation for all SpolDB4 spoligotypes. Third, we propose new 
signatures to subclassify the T family. 

Conclusion: Altogether, this study shows how the new clustering algorithm Affinity Propagation can help building 
or refining clonal organims classifications. It also describes well-supported families and subfamilies among M. 
tuberculosis complex, especially inside the modern "Euro-American" lineage. 

Keywords: asexual organisms, species delineation, epidemiology, DR locus, Clustered Regularly Interspaced Short 
Palindromic Repeats (CRISPR) 



Background 

The advent of powerful genotyping methods, either by 
global sequencing or by high-throughput analysis of var- 
iation at specific loci (mini- or micro-satellites [1]; 
CRISPR (Clustered Regularly Interspaced Short Palin- 
dromic Repeats) loci [2,3]; SNPs [4]), provides masses of 
genetic data that need to be compared and clustered. 
Most widely used comparison methods are phylogenetic 
methods i.e. hierarchical clustering, building tree-like 
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structures to display the diversity. These methods con- 
sider that each individual forms a cluster and repeatedly 
merge the most similar clusters based on pairwise dis- 
tances (Phenetics such as Neighbour-Joining), or try to 
infer the tree that most fits the data (Cladistics such as 
Maximum Likelihood, Bayesian analysis) using an 
appropriate evolutionary model of the compared charac- 
ters. This provides a continuous pattern of how diver- 
gent organisms are. Other comparison methods consist 
in finding relevant clusters, a process referred to as par- 
titioning. A method made popular by the software 
Structure [5], and referred to as model-based clustering, 
consists in using Bayesian methods to assign individuals 
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in a pre-determined number of populations. The 
assumption underlying this software is that the popula- 
tion conforms Hardy- Weinberg hypotheses i.e. refers to 
organisms reproducing sexually, with random pairing 
inside the population. This assumption is theoretically 
very problematic for clonal organisms, although practice 
has shown that it can provide meaningful results [6], 
partly because some parameters can be set to mimic 
poor mixture inside populations. Other methods have 
been developed outside biology, for instance to categor- 
ize images [7,8]. They use similarity to group data in 
spherical clusters well represented by their centroid 
(also called representative or exemplars), and have 
already been tentatively used to classify human genetic 
data [9]. This method awaits further experimental vali- 
dation on large datasets. 

Clustering methods can be applied to different types 
of loci, ranging from repetitive sequences such as inser- 
tion sequences, micro-, mini-satellites or the CRISPR 
loci to single nucleotide polymorphisms (SNPs), pro- 
vided an appropriate method is available to calculate the 
distance between individuals. Such methods usually rely 
on a model of the mutation process. Which loci should 
be targeted depends on the mean divergence time 
between individuals, as repetitive sequences mutate fas- 
ter than SNP loci. Several mutation models have been 
developed for DNA sequence with point mutations [10]. 
For repetitive sequences (micro- and mini -satellites), 
categorical distance or the Stepwise Mutation Model 
(SMM [11]) are mostly used. 

CRISPR loci (Clustered Regularly Interspaced Short 
Palindromic Repeats) form a new family of repetitive 
sequences [12,13]. They consist in the repetition of 24 
to 47 bp sequences called Direct Repeats (DR) separated 
by unique sequences called spacers (from 26 to 72 bp). 
The constitutive unit is therefore the combination of 
one DR and one spacer, and presently described CRISPR 
loci have from 2 to 249 units [13]. These repetitions are 
surrounded by protein-encoding sequences called cas 
genes (derived from CRISPR-^5sociated genes). The 
whole locus confers resistance to bacteriophages and 
plasmids in Streptococcus thermophilus [14,15] and in 
Escherichia coli when overexpressed [16]. Resistance to 
the corresponding organisms is under investigation in 
species where spacers are homologous to foreign DNA 
[17]. They exhibit a quite high mutation rate so that 
they have proven useful for epidemiological studies. 
Describing the presence or absence of 43 spacers of M. 
tuberculosis CRISPR locus has become a routine techni- 
que in any tuberculosis reference center and is referred 
to as spoligotyping for spacer o/Zgonucleotide typing 
[18]. Pairwise comparisons of binary profiles can provide 
a distance matrix that has been used to infer phyloge- 
netic relationships. The most common approach to infer 



relationships using spoligotype patterns uses the Jaccard 
index (same principle as Hamming distance or Dice 
coefficient) as distance [19], counting the proportion of 
spacers that are present in both profiles. The distance 
matrix, either made of pure spoligotyping data or com- 
bining them with minisatellite data, is usually processed 
using UPGMA or NJ algorithm to build a dendrogram 
or a phylogenetic tree [20]. A more elaborate approach 
using the Zipf distribution and the evolutionary 
dynamics of CRISPR loci has proven more relevant to 
infer phylogenetic relationships for the M. tuberculosis 
complex [21] but is not implemented in a user-friendly 
software yet and does not propose assignations for all 
currently described spoligotype patterns. 

The worldwide database of spoligotyping in M. tuber- 
culosis complex is called SpolDB (the latest public ver- 
sion being SpolDB4), and has helped identifying 
recurrent signatures in CRISPR profiles [22-24]. These 
signatures, mainly based on the absence of adjacent 
spacers, led to the naming of large clonal families, the 
monophyly of which has been confirmed through other 
markers such as minisatellites (referred to as MIRU- 
VNTR for Mycobacterial-Interspersed-Repetitive-Units- 
Variable Number of Tandem Repeats), Region of Dele- 
tions (RDs) and SNPs [6,25,26]. Main acknowledged 
families are EAI for East- African-Indian (later referred 
to as "Indo-Oceanic" by Gagneux et al.), M. africanum 
1 and 2 (later "West-african 2" and "West-African i"), 
animal strains (M. bovis, M. caprae, M. pinnipedii, M. 
microtii), CAS for "Central-Asia" (later "East-African- 
Indian"), Beijing, X, Haarlem, LAM for "Latino- Ameri- 
can and Mediterranean", T and MANU (also designated 
as T ancestor) [23,27]. Monophyly of each of the LAM, 
T and Haarlem families has been partly invalidated. 
However, the larger lineage to which they belong and 
that is characterized by the 33-36 spacers deletion at the 
CRISPR profile (merging T, LAM, X, Haarlem families 
and S subfamily) has been confirmed and designated as 
the "Euro- American" lineage [27]. It corresponds to the 
Principal Genetic Groups (PGG) 2 and 3 as defined by 
Sreevatsan et al. [28]. Altogether deletions in spoligo- 
type patterns have proven to contain phylogenetic 
information and allow most strains be assigned to the 
families described above. Assignations performed by 
experts are reported in SpolDB4 database, patterns car- 
rying no or contradictory signatures been labeled as 
"U" for "Unknown or Unclassified". To circumvent the 
dependence on experts' analysis, the Bennett's labora- 
tory proposed automatized classification of spoligotype 
patterns using Bayesian algorithms and a distance 
method taking into account the deletion process by 
which spoligotype patterns evolve. They provide an 
on-line tool called Spotclust [29] to assign each spoli- 
gotype to a family, either one described in SpolDB3 



Borile et al. BMC Bioinformatics 201 1, 12:224 
http://www.biomedcentral.eom/1 471 -21 05/1 2/224 



Page 3 of 14 



[30] or one of the 6 large families proposed by Gag- 
neux and coworkers [31]. 

Here we wanted to take advantage of a recently devel- 
oped algorithm, Affinity Propagation, to confirm and 
extend these methods. This algorithm identifies refer- 
ences for every data point so that data are grouped and 
centered on these references while a specific cost func- 
tion is minimized. The cost of adding a new reference 
point, assigned by the user, determines the final number 
of clusters. Prior to the use of this algorithm, we tested 
different distances to calculate pairwise distances 
between spoligotype patterns. We took advantage of 
previously identified references and expert assignation 
to rank these distances, some of which are derived from 
previously proposed evolutionary models [21,31]. The 
distance that best identified the appropriate reference 
for each spoligotype pattern was implemented in the 
Affinity Propagation algorithm to identify relevant sub- 
families among M. tuberculosis complex (MTC). These 
families partly correlated with previously identified 
subfamilies. 

Altogether, this approach allowed us to assess the 
robustness of previously identified sublineages among 
MTC, to identify new relevant sublineages and to pro- 
vide re-assignations of the spoligotype patterns 
described in SpolDB4. These re-assignations interest- 
ingly matched those of studies using VNTR and/or SNP 
data. 



Table 1 References of the ten best acknowledged M. 
tuberculosis complex families 



SIT 


SpolDB4 classification 


Reference Spoligotype pattern 




family 


subfamily 




1 


BEIJ 


BEIJ 




26 


CAS 


CAS1 




42 


LAM 


LAM9 




50 


H 


H3 




53 


T 


T1 




100 


MANU 


MANU1 




119 


X 


X1 




236 


EAI 


EAI5 




181 


AFRI 


AFRI1 




482 


animal 


B0V1 





BEIJ = Beijing also referred to as East Asia; CAS = Central Asia also referred to 
as East Africa and India; LAM = Latino-American and Mediterranean; H = 
Haarlem; EAI = East African Indian. 



spacers in the CRISPR locus, "Blocks" measuring the 
proportion of shared blocks of spacers, and "Deletions" 
measuring the proportion of shared blocks of deleted 
spacers (see Methods and Figure 1). We implemented 
these four methods to compute the distances of each 
spoligotype of SpolDB4 database [33] to the reference 
profiles of the ten families (see Table 1). For each 
method, depending on the reference to which it was 
most similar, each spoligotype was assigned to one of 



Results 

Comparison of classifications based on new distances or 
on Jaccard index to expert classification of SpolDB4 

Clustering of CRISPR patterns (spoligotypes) of M. 
tuberculosis complex is commonly done using the Jac- 
card index as distance [32]. This index counts the 
shared spacers without taking into account their spatial 
organization. However, it has been shown that adjacent 
spacers have a higher probability to be simultaneously 
deleted [21], and this feature has been used by experts 
to define M. tuberculosis complex families and subfami- 
lies [22,23] in the international database SpolDB [33]. 
We wanted here to identify a distance conducing to the 
best concordance of spoligotype assignations at the 
family level, as available in SpolDB4 database [23]. We 
retained the ten widely acknowledged families: M. afri- 
canum, Animal strains (grouping M. bovis, M. pinnipe- 
dii, M rnicrotii, and M caprae), Beijing (herein also 
referred to as Beij), CAS, EAI, Haarlem (also referred to 
as H), LAM, MANU, T and X [25]. Each is character- 
ized by a different spoligotype signature and thus a dif- 
ferent reference profile [22,33] (Table 1). In addition to 
Jaccard index, we set up three methods to compute the 
distance between pairs of patterns: "Domain Walls" 
measuring the proportion of shared limits of blocks of 



A. Jaccard 

^^^^□□□□□□□□□□□□□□□□□□□□fffffDDDDfffffn 

^□□□□^□□□□□□□□□□□□□□□□□□□□|Mg|DDDDMjp|M 

2 +2*2 +4*2 +7*2 =28/32=87.5% 



B. Walls 

■■■■■■■^□□□□□□□□□□□□□□□□□iiiidnnniii 

^□□□□^■^□□□□□□□□□□□□□□□□□□^■■■^□□c^i 

2 +2 +2 +2 +2 



C. Blocks 

gXIDDDDIXDDDDDrXIITTTII 



W 



+2=12/14=83% 



[^□□□^□□□□□□□□□□□□□□□□□□i 



+2 



=4/7= 57% 



D. Deletions __ 

■■■■■■^□□□□□□□□□□□□□□□□□□□diiiiibDDdi 

■oronnDuuuuuuLuuuuuuuuLunqiiinpnnqi 



2 +2 =4/5= 80% 

Figure 1 Distance methods. A: classically implemented Jaccard 
index. B-D: newly proposed distance methods, w = Domain Walls 
also referred to as walls. Numbers below the spoligotype patterns 
count the number of their common features: either the number of 
common spacers (A), common walls (B), common blocks (C), or 
common deletions (D). These numbers are summed and divided by 
the total number of features in the two spoligotype patterns to 
obtain the similarity between the two spoligotypes. 
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0.6 - 



0 4 



0.2 
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□ 


Correct 


□ 


Ambiguous 


□ 


False 



Figure 2 Assignations matches between SpolDB4 and the 
different distance methods on whole SpolDB4 database (n = 
1937 SIT). References are those described in Table 1. Assignations 
were performed according to the reference for which the distance 
was the lowest. The patterns for which the most similar reference is 
the same as that indicated by its SpolDB4 assignations, were scored 
as "Correct". Note that "Domain Walls" and "Deletions" have equally 
high values of assignations agreeing with the expert classification. 
When the method identified two identically similar references for a 
pattern, this pattern was scored as Unassigned and described as 
Ambiguous assignation. Ambiguity was the lowest with "Domain 
Walls" method. 



the ten families. Spoligotype patterns for which two 
references were equally similar were coined as Unassign- 
able. These assignations were compared to the SpolDB4 
classification. The Jaccard method rarely associated the 
spoligotype patterns to their reference (only 12.3% of 
correctly assigned spoligotypes, see Figure 2). The meth- 
ods that best fitted the expert classification were the 
"Deletions" (84.2% of correctly assigned spoligotype pat- 
terns, n correct = 1402), and the "Domain Walls" method 
(84.0%, n correct = 1399). These methods also provided 
the smallest amount of assignations that differed from 
those of the experts ("Deletions": 11.0%, n fa i se = 183; 
"Domain Walls": 12.3%, n false = 204). These methods 
thus appear to be the best for fitting the expert classifi- 
cation out of the four methods we tested. 

"Deletions" method succeeds in correcting false SpolDB4 
assignations 

Some families' assignations provided by SpolDB4 have 
been debated. For instance, patterns classified as LAM7- 
TUR [32] have been found not to be related to the 
LAM family as strains carrying that pattern do not 
share the ligB 1212 mutation that defines the LAM family 
[34]. Such strains were instead related to T-strains [35] 
and were renamed TUR. All methods tested here except 



the "Deletions" method still assigned them to LAM 
family, including Spotclust. The "Deletions" method 
assigned them all (n LAM7 _ TUR = 8) to the T family as did 
methods using VNTRs [35] or SNPs [34] (Table 2). 
Similarly, all spoligotypes assigned to the H4 subfamily 
(n H 4 = 34) in SpolDB4 were recently excluded from the 
Haarlem family based on them not carrying the mgtC 545 
mutation [34]. They were renamed "Ural" and appropri- 
ately assigned to the T family by the "Deletions" method 
only (Table 2). Hence, part of the assignations suspected 
to be wrong with the "Deletions" method as compared 
to expert classification may in fact correct previous clas- 
sification errors. The "Deletions" method thus recog- 
nizes phylogenetic lineages better than "Domain Walls" 
method and likely at least as well as the expert eye and 
Spotclust. This is further supported by the clear gap 
between the similarity of correctly assigned spoligotype 
patterns to their reference (Figure 3D, black boxes in 
the Deletions plot) and the highest similarity to any 
reference of patterns assigned differently than by the 
expert (light gray boxes) specifically with the "Deletions" 
method. 

Interestingly, Beijing, X and EAI families exhibited no 
incongruence between the "Deletions" and the expert 
method (no light gray box), suggesting that these 
families are clearly and appropriately defined. As 
reported above (Figure 2), the Jaccard method failed to 
assign most spoligotype patterns to any family; for 
instance, no spoligotype patterns could be assigned to 
BEIJ, EAI or X families (Figure 3A) with a maximum 
similarity to any reference not reaching 20% for BEIJ 
family (Additional file 1). "Domain Walls" and "Blocks" 
methods provided either poor resolution between cor- 
rectly and wrongly assigned spoligotype patterns (Figure 
3A and 3C), or a lower number of families with no dis- 
crepancy with the expert classification (only the X 
family with the "Domain Walls" method, Figure 3B). 

Assignations of U spoligotype patterns 

Assignations thus seem phylogenetically relevant using 
the "Deletions" method and the references of the well- 
acknowledged families. We thus propose an alternative 
spoligotype patterns classification on the 1939 spoligo- 
types reported in SpolDB4 (Additional File 2). Assigna- 
tion rate of "U" (Unclassified) patterns was relatively 
low with this method as compared to others (81 out of 
272 U patterns, i.e. 29.8%, Figure 4) but may be more 
reliable as exemplified by three U patterns recently 
assigned [34]: "Deletions" method could only assign one 
of them but without error whereas two of the three 
assignations provided by the "Domain Walls" method 
and Spotclust algorithm were erroneous (Table 2, SIT 
105, 1274 and 1531). 
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Table 2 Assignations of LAM7, H4 and selected "U" spoligotype patterns from SpolDB4, according to different 
methods. 



SIT 


spoligotype pattern 




SpolDB4 


Recent litte 


rature Deletions 
ion family 


Domain Walls 
family 


SpotClust subfamily 


family 


Sub-family 


assignat 


SpolDB3-based 


RIM 


41 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


N19 


186 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


N19 


367 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


A/79 


930 




LAM 


LAM7 


T-TUR T 


LAM 


MM 7 


A/79 


1261 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


A/79 


1589 




LAM 


LAM7 


T-TUR T 


LAM 


LAM3 


A/2 


1924 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


A/79 


1937 




LAM 


LAM7 


T-TUR T 


LAM 


LAM9 


A/79 


35 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


262 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


361 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


399 




H 


H4 


T-Ura 


T 


H 


12 


N34 


596 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


597 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


656 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


762 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


777 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


817 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


920 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


921 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


922 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1117 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1134 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1165 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1174 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1242 




H 


H4 


T-Ura 


T 


H 


u 


N34 


1269 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1276 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1281 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1292 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1447 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1448 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1457 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1568 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1581 




H 


H4 


T-Ura 


T 


H 


H3 


N34 


1384 




H 


H4 


T-Ura 


T 


U 


13 


N40 


1446 




H 


H4 


T-Ura 


T 


U 


H3 


N34 


1452 




H 


H4 


T-Ura 


T 


U 


H3 


N34 


1455 




H 


H4 


T-Ura 


T 


U 


U 


N34 


1456 




H 


H4 


T-Ura 


T 


U 


H3 


N34 


1461 




H 


H4 


T-Ura 


T 


U 


H3 


N34 


1480 




H 


H4 


T-Ura 


T 


U 


LAM9 


A/79 


105 




U 


U 


H 


U 


Afri 


LAM7 


N29 


1274 




U 


U 


LAM 


U 


Afri 


HI 


A/5 


1531 




u 


u 


X 


X 


X 


XI 


N44 



"Recent literature assignation" represents the standard, and refers to studies using loci other than the CRISPR locus: T-TUR classification has been suggested both 
by Millet et al. [35] and Abadia et al. [34] based respectively on VNTR signature and SNPs signatures. T-Ural classification has been suggested by Kovalev et al. 
[36] as they clustered with H37Rv strains and Abadia et al. [34]. RIM: Randomly Initialized Model. N ... families as defined by Spotclust are described on their web- 
site based on what SpolDB4 families/sub-families are mostly represented: N2 family is described as LAM3-rich; N5 as Hi-rich; N19 as LAM-rich; N29 as LAM+EAI- 
rich; N34 as H3+S-rich; N40 as T3-africanum-rich; N4 as X1-H37Rv-rich. The assignations matching the standard are shown in bold characters. Assignations failures 
are shown in italic. Note that the "Deletions" method provides the highest number of exact assignations and the least assignation failures. 
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Figure 3 Plot of similarity to their reference for patterns assigned as the expert classification (Black) and differently than the expert 
classification (Gray). Black boxes: box plots of the similarity to their reference for patterns with congruent classification between distance-based 
method and expert-defined. Boxes extend from 0.25 to 0.75 quartiles, and whiskers to the most extreme values. Median is highlighted by a thick 
bar. Grey boxes: box plots of the similarity to their reference for patterns with incongruent classification between distance-based method and 
expert-defined. Families for which no spoligotype patterns gave ambiguity show a single (black) box, corresponding to the mean similarity of 
congruent assignations (Beij, X and EAI with "Deletions" method; X for "Domain Walls"). Families for which no spoligotype patterns were found 
similar to the reference show no data (Beij, T, X, EAI with "Jaccard" method). BEIJ = Beijing; afri=M ofriconum; CAS = Central Asia; LAM = Latino- 
American and Mediterranean; H = Haarlem; EAI = East African Indian. 

V J 



Automatized identification of references by Affinity 
Propagation clustering 

The "Deletions" method is highly useful to classify spoli- 
gotype patterns in the described families, but this classi- 
fication highly depends on the identification of 
references. These references are widely acknowledged 




u tn 0) tn 

= c 

£ 5 m £ 

-i CD 

Q 

Figure 4 Assignations of 'U" patterns managed by the 
different methods. Percentage was calculated based on the 272 
"U" patterns found in SpolDB4. 



for major families but the relevance of finer classifica- 
tion is recurrently debated [25,27,35]. Affinity propaga- 
tion is an algorithm that identifies a representative (also 
called exemplar) for each data point in an iterative man- 
ner until the chosen configuration of exemplars mini- 
mizes a suitable cost function that depends on the 
choice of the clusters (see Methods). A parameter set by 
the user (that we denote as 'penalty', p) determines an 
additional cost for every exemplar found. When p is low 
(high negative value), large clusters are built where 
some data points have relatively low similarity with their 
representative. As p increases, the clusters reach smaller 
sizes so that they become 

numerous, and the mean similarity with the represen- 
tative increases. Interestingly, when the number of clus- 
ters does not vary even if the penalty changes, this 
indicates that the data points are not evenly distributed, 
i.e., form meaningful clusters. When applying this 
method to the SpolDB4 dataset, relevant numbers of 
clusters were found to be 14 and 32 (Figure 5). The 
mean similarity with the representative was higher using 
Affinity Propagation as compared to K-Means or with 
Bionumerics applying hierarchical clustering (Additional 
File 3). The clustering in 14 clusters reproduced most of 
the 10 references identified by the experts (references 
for animal strains, CAS, EAI, H, LAM, T, and X, Table 
3). However, H family was divided in HI and H3, none 
of them including the H4. H4 was grouped with T spoli- 
gotype patterns as suggested by previous studies that 
renamed it Ural [34,36]. We propose some renamings 
according to major families represented in each cluster 
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penalty (p) 

Figure 5 Number of clusters found by Affinity Propagation as a function of the penalty p for the distance between a data point and 
its reference. Note that two plateau can be detected, at 14 and 32 clusters respectively, indicating that the corresponding clustering is robust, 
and therefore might be relevant. 

I J 



Table 3 References after Affinity Propagation clustering for n dusters = 12. 

AP-family Reference Majoritary SpolDB4 family 

SpolDB4 subfamily SIT Spoligotype pattern Family Proportion in the AP-family Total Nb 



animall (Bovl -3-Cap-Mic-Pin) 


bovis_1 


482 — 


— Animal 


0.888 


206 


animal2(Bov2) 


bovis_2 


683 — 


— Animal 


0.621 


66 


Beij-afri 


BEIJ 


255 — — — 


— Afri 


0.339 


56 


CAS 


CAS_1 


26 — — 


- CAS 


0.760 


96 


EAI 


EAI_5 


236 


- EAI 


0.84 


250 


H1-2 


H_1 


47 


_ h 


0.853 


68 


H3 


H_3 


50 


_ H 


0.874 


111 


LAM(9_3_i 1 _ 6 ^ ) 


LAM_9 


42 — 


— LAM 


0.721 


179 


T2 


T_2 


52 


T 


0.545 


145 


T3-LAM (2 _5) 


LAM_2 


17 . — _ 


— LAM 


0.432 


148 


T-(Ural-H3-LAM 10 - 7 )) 


T_1 


53 


_ T 


0.823 


351 


S(&U) 


S 


34 — 


_ T 


0.554 


74 


T(&U) 


T_1 


173 — — 


_ T 


0.420 


81 


X 


X_1 


119 


— X 


0.75 


108 



BEIJ = Beijing (also East Asia); afri=M. africanum; CAS = Central Asia (also East Africa and India); LAM = Latino-American and Mediterranean; H = Haarlem; EAI = 
East African Indian. Confirmed families are shown in bold. 



Borile et al. BMC Bioinformatics 201 1, 12:224 
http://www.biomedcentral.eom/1 471 -21 05/1 2/224 



Page 8 of 14 



Table 4 References after Affinity Propagation clustering for n C | USters = 32. 



AP-subfamily 
naming 



reference 



Most represented subfamilies 



Nb of spoligotype 
patterns 



Classical subfamily 
naming 



SIT 



spoligotype (43 
format) 



First most 
represented 
subfamily 

Subfamily Prop. 



Second most repr. 
family 



Afril AFRI1 181 

Afri2-3 AFRI2 331 

Beij BEIJ 1 

Bovl-3 B0V1 482 

Bov2 B0V2 683 

Cap CAP 647 

CAS CAS1 26 

EAI1 EAI1 48 

EAI3-5 (del2-3-37- EAI2 1 1 
38-39) 

EAI2 (del3-20-21) EAI2 19 

EAI EAI5 236 

EAI6 (del23-37) EAI6 292 

H1-2 H1 47 

H3 H3 50 

Ural H4 262 

LAM5-2-1(del3-13) LAM2 17 

LAM3 LAM3 33 

LAM LAM9 42 

Manu MANU2 54 

Pin-Mic PIN 637 

S S 34 

T (T1-H3-Lam10-Cam) T1 53 

Tla (del5-40-43) T1 833 

T1b(del21) T1 291 

Tic (dell 5) T2 118 

T2 (del40) T2 52 

T3 (dell 3) T3 37 

T4 (del 19-23-24-38- T4 39 
39) 

T5 (del23) T5 44 

X XI 119 

X2 X2 137 

SEA1 (del29-34) U 458 



AFRI1 
AFRI2 
BEIJ 
BOV1 
BOV2 
CAP 
CAS1 
EAI1 
EAI5 

EAI2 
EAI5 
EAI6 
H1 
H3 
H4 
LAM5 
LAM3 
LAM9 
MANU2 
BOV 
S 

T1 
T1 
U 
T1 
T2 
T3 
T1 

T5 
X1 

X2 
U 



0.531 
0.364 
0.842 
0.585 
0.467 
0.75 
0.487 
0.804 
0.383 

0.5 
0.651 

0.5 
0.790 
0.927 
0.714 
0.207 
0.455 
0.574 
0.793 
0.391 
0.678 
0.828 
0.484 
0.367 
0.432 
0.521 
0.373 
0.406 

0.561 
0.492 
0.824 
0.955 



AFRI 
AFRI3 

U 
BOV 
BOV 

u 

CAS 

u 

EAI3 

U 
EAI4 
EAI5 

U 

U 

u 
u 
u 

LAM 11 
U 

u 
u 

H3 

U 
T1 

U 

U 

U 

T4 

U 
U 
T1 

CAS 



32 
22 
19 
159 
45 
20 
80 
46 
55 

48 
86 
42 
62 
96 
28 
92 
33 
136 
29 
23 
59 
261 
31 
30 
37 
119 
59 
32 

41 
61 
34 
22 



BEIJ = Beijing (also East Asia); afri=M. africanum; CAS = Central Asia (also East Africa and India); LAM = Latino-American and Mediterranean; H = Haarlem; EAI = 
East African Indian. Confirmed subfamilies are shown in bold. 



(Table 3). When performing clustering with 32 clusters, 
many of the SpolDB4 subfamilies were identified. Some 
of them were however merged such as africanum2-afri- 
canum3, Bovisl-3, pinnipedii-microtii, LAM2-LAM1- 
LAM5, XI and 3 in X, etc. (Table 4). Among EAI, four 
meaningful subfamilies were identified whereas only 2 
were among LAM. This suggests that the LAM family 
was oversplitted in the expert classification. In contrast, 
seven subfamilies were found among the T family. Two 
of them exhibited complex signatures with few 



spoligotype patterns actually matching the whole signa- 
ture (for example, n = 5/29 among Tla). Subfamilies 
T2, T3 and T5 were confirmed by this method. One 
family still had SIT53 (Tl) as a reference indicating that 
many spoligotypes (n = 261) still cannot be further clas- 
sified according to their spoligotype pattern. Last, one 
family was created from U spoligotypes and has SIT 458 
as a reference. Most patterns carried the deletion of 
spacers 29 to 34 that could constitute a new significant 
signature. Countries in which the corresponding SITs 
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Table 5 Spoligotype patterns clustered with SIT 458 with 
Affinity Propagation when n dusters = 32. 
SIT Spoligotype pattern SpolDB4 assignation Main country 



458* 


U 


THA 






GBR 






GNB 


527 


U 


GNB 


863 — — 


— u 


BRA 


1 1 72 °™° 


— u 


EST 


1 186 


— u 


THA 


1 1 87 — — ° 


— u 


THA 


1374 — 


— u 


MYS 


1386 


- — — u 


BGD 


1436 ■ — - 


- — u 


BGD 


1462 — 


— u 


GEO 


1515 


— u 


MDG 


1518 ■ 


— u 


MDG 


1519 _ „ 


— u 


MDG 


1520 = — - 


- — u 


MDG 


1521 — — — 


- — u 


MDG 


1524 — — 


— u 


MDG 


710 _ 


— u 


NLD 


405 — 


— u 


VNM 


426 — — 


CAS_2 


USA 


523 


U 


MYS 



Note that most of the patterns carry the 29-34 spacers deletion, and that 
most of them are unclassified by SpolDB4. "Main country" refers to the 
country where the highest number of isolates carrying this pattern were 
found according to SpolDB4 [33]. * indicates the spoligotype proposed as a 
reference by the Affinity Propagation algorithm. The countries are identified 
via the IS03166-1 alpha-3 code. THA = Thailand; GBR = United Kingdom; GNB 
= Guinea-Bissau; BRA = Brazil; EST = Estonia; MYS = Malaysia; BGD = 
Bangladesh; GEO = Georgia; MDG = Madagascar; NLD = Netherlands; VNM = 
Vietnam; USA = United States. 

were most abundant surround the Indian Ocean: Mada- 
gascar, Thailand, India and Vietnam (Table 5). We thus 
named it SEA1 (South East Asia 1) (Table 4). The sig- 
nificance of this signature compared to the classical EAI 
signature, which differs only by the presence of spacer 
33, remains to be established. 

Discussion 

Here we first validated a simple distance method that 
can be used to classify CRISPR genetic profiles based on 
a worldwide M. tuberculosis spoligotype database. Sec- 
ond, using a recent clustering algorithm exploiting a dif- 
ferent approach with respect to those commonly used in 
the biological sciences community, we could identify an 
alternative M. tuberculosis classification. The compari- 
son between the largely validated expert classification 
described in the international database SpolDB4 and our 
alternative classification validates our approach for M. 
tuberculosis CRISPR profiles, opening the way for its use 
for other bacterial species where CRISPR were shown to 
provide interesting typing information [16]. 



Clustering power of CRISPR patterns 

M. tuberculosis complex (MTC) has been infecting 
humans for at least 2600 years [37] and could be 20,000 
years old or even much older [6,38]. Despite its 
restricted genetic diversity even between human and 
animal strains [39,40], phylogenetical relationships have 
been detected using polymorphic DNA sequences 
[41,42]. CRISPR loci characterized using the "spoligotyp- 
ing" technique have been used to define families 
through the use of so-called signatures i.e. the absence 
and presence of characterized units of CRIPSR loci, the 
spacers [43]. Most of these families found independent 
support such as host range or congruence with indepen- 
dent genetic markers [22,23,44], even SNPs and Regions 
of Deletion [26,34,45]. However, some of them were "ill- 
defined" i.e. had a signature that was shared by several 
other groups, and others were defeated by independent 
loci: H4 subfamily was renamed Ural as it was related to 
T strains and not H strains [34,36], LAM7 and LAM 10 
were renamed TUR and Cameroon respectively as they 
are unrelated to LAM strains [6,34,35]. As a conse- 
quence, the use of CRISPR patterns to infer phylogeneti- 
cal relationship was recurrently discussed [44,46]. 

We used here an automatized approach for clustering 
CRISPR patterns. Our clusters largely reproduced the 
well-acknowledged MTC families and provided mean- 
ingful clustering for Ural, TUR and Cameroon. In fact, 
the misclassification of Ural among Haarlem family was 
due to the merging of all signatures having spacer 31 
deleted and spacer 32 present disregarding the left bor- 
der of the deletion. This classification criterion is not 
relevant knowing the evolutionary dynamics of CRISPR 
loci due either to the insertion of IS6110 elements or to 
the deletion of one of several adjacent spacers. This 
kind of errors is avoided if comparison is performed 
using an algorithm identifying complete signatures (left 
and right borders of the deletions) as included in our 
automatized approach (see below for a detailed discus- 
sion on methods used to calculate distances between 
strains). 

Still, the fact that some families are "ill-defined" is an 
intrinsic problem of spoligotyping: CRISPR loci in M. 
tuberculosis are relatively short and they evolve at a rate 
that cannot exclude the absence or the insufficient num- 
ber of mutation in some lineages. This intrinsically lim- 
its the power of our study, i.e. we cannot classify all 
strains, and not all of them with the same precision. 
However, this problem does not affect the assignation 
quality of the strains we classify which are in fact 
numerous (more than 80%). 

We thus argue that CRISPR profiles evolving by the 
insertion of transposable elements or by deletion such 
as those of M. tuberculosis contain relevant information 
for clustering and even inference of some phylogenetic 
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links. The targeted locus must however not be missing 
for the individuals to be classified. To avoid this pitfall, 
the use of CRISPR loci should restrict to recently 
diverged groups as is the M. tuberculosis species com- 
plex (more than 99.9% identity). Such organisms 
uncover diverse human pathogens such as Yersinia pes- 
tis, Salmonella enterica, Bacilllus anthracis, Mycobacter- 
ium leprae and Mycobacterium ulcerans. Still, the use of 
CRISPR profiles in phylogenetic reconstructions would 
benefit from further developments and validations for 
species with still active CRISPR loci. 

Distance methods for CRISPR profiles 

If CRISPR can be used to infer phylogenetic relation- 
ship, the evolutionary model or distance method used 
during the inference is also of great importance. Several 
developments had been proposed until now. We want 
to discuss here what our approach adds to previous 
ones. 

CRISPR profiles (spoligotype patterns) form a 
sequence of binary data. As such, it has been analyzed 
with tools developed for binary information such as the 
Jaccard Index that focuses on the sharing of every unit 
in the profile (here the spacers) taken independently. 
This however ignores an essential feature of the corre- 
sponding CRISPR locus: that it evolves by the loss of 
spacers. These losses can occur either because of the 
insertion of a transposable element that disrupts the 
sequence used in the spoligotyping technique, or by 
deletion. Deletions can occur for several spacers at once, 
even if the frequency of large deletions may be lower 
[21]. As a consequence, the distance between two pat- 
terns, one carrying many spacers and the other carrying 
one large deletion, should not be considered as propor- 
tional to the number of spacers that were lost (as done 
by the Jaccard index), but as corresponding to a single 
mutation event. The methods proposed by the Bennett 
laboratory [30,31] take into account the deletion process 
and add a probability function that best mimics the fre- 
quency of deletion size. In Spotclust, a Bayesian algo- 
rithm incorporating the inference of ancestral 
spoligotype patterns based on SpolDB3 database is used 
to assign spoligotypes to SpolDB3 subfamilies or to 
families built using a Randomly Initialized Model (RIM) 
[30]. We showed here that, by simply using expert- 
defined references of main families and the "Deletions" 
distance method that is based on deletion signatures, we 
could better assign Unknown spoligotype patterns than 
Spotclust as well as correct previous erroneous assigna- 
tions in SpolDB4 classification such as those to LAM7 
(TUR) [29]. For Spotclust algorithm, this was true for 
both the SpolDB3-based classification and the Randomly 
Initialized Model. The reason for that could be either 
that the size of the database used by Spotclust was too 



small to capture evolutionary steps relevant to MTC 
evolution, or that Bayesian statistical inferences are too 
dependent on priors. 

Performance of the Affinity propagation algorithm on 
CRISPR profiles clustering 

Affinity Propagation is a message-passing algorithm that 
considers clustering as a problem of minimizing an 
"energy" function of the clusters configuration in the 
data set (see Methods section for a general review of the 
algorithm, and [8]). This approach seems particularly 
promising and could help solving species delineations in 
asexual lineages where obligate gene exchange cannot 
be used as a delineation criterion [9]. One of the main 
features of the algorithm is the possibility of regulating 
the total number of clusters as a function of an input 
parameter of the algorithm (called the "chemical poten- 
tial" u., by analogy to the chemical potential of physical 
systems, or p for penalty parameter, see Methods). Also 
the high speed (the computational time goes as N 2 if N 
is the size of the dataset) and thus the possibility to ana- 
lyze very large networks is encouraging the use of this 
algorithm. With this method we identified both families 
and subfamilies in MTC. A single family out of 14 made 
no sense (Beijing-africanum). This is likely due to a lack 
of information in Beijing spoligotype pattern as the large 
1-36 deletion limits the recognition of other signatures. 
When considering patterns carrying a larger number of 
spacers, the classification was largely congruent with the 
literature. In addition, we could identify new signatures, 
especially one, termed SEA1, among previously unclassi- 
fied patterns. We therefore believe that this algorithm is 
very useful for classifying the widely used 43 -spoligotype 
patterns in M. tuberculosis but could prove even more 
useful on patterns larger than 43 spacers, e.g. the 
improved 68 spacers format. 

"Euro-american" lineage evolution 

Despite large sequencing efforts [25,47], there has been 
a standing difficulty in unraveling the relationships 
inside "Euro-American" lineage strains (carrying the 33- 
36 deletion in the spoligotype pattern), especially in the 
so-called "T family" described in SpolDB4 [23]. Here, 
using SpolDB4 database, we could challenge expert- 
defined families and subfamilies. We first confirmed the 
validity of S and T2 subfamilies that we suggest to con- 
sider as families. The S family was first described in 
Sicily [48] and independently identified in Quebec 
where a sublineage was shown to harbor a peculiar 
pncA SNP [49]. The T2 family, defined by the absence 
of spacer 40 was originally described as M. africanum 2, 
however was shown later to be a bona fide M. tubercu- 
losis subfamily [50]. We also confirmed the reliability of 
Haarlem family subclustering, if renaming H4 as Ural, 
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and suggest considering Hl-2 and H3 as two families. 
We confirmed the validity of T3 and T5 families as well 
as T4-CEU (although T4 alone was invalidated). Some 
LAM subfamilies renamings based on VNTR and SNP 
loci [34-36] are given further support (LAM7 as TUR, 
LAM10 as Cameroon), while other were merged 
(LAM1, 2 and 5). The tendency to merge many expert- 
defined families was not pervasive. Indeed in the EAI 
family, four subfamilies out of the 5 expert-defined ones 
were confirmed. 

Combining the families and subfamilies identification, 
we could provide a simplified evolutionary scheme for 
this lineage (Figure 6). We hope in the future, by apply- 
ing affinity propagation on 68 spoligotype patterns, to 
identify other Euro-American subclusters. 



Hl-2 




Xl-3 



X2 



LAM2-1-5 



T4-CEU 



Other Ts 

Figure 6 Evolutionary scheme of the Euro-American supported 
sublineages. Note that our study does not identify the monophyly 
of H1-H2 and H3. Monophyly of T sublineages is not supported by 
this method either. LAM monophyly (once LAM7 and LAM10 were 
extracted) is in contrast well supported. 



Conclusion 

This study describes 1) a novel distance method to be 
applied on genetic loci evolving by deletion, as for 
instance do inactive CRISPR loci, 2) a framework to 
take advantage of identified references for classifying 
individuals using such loci, 3) a way to identify new 
references using the Affinity Propagation algorithm [8], 
and 4) assignations and assignation tools for M. tubercu- 
losis complex. The distance method and the framework 
to identify known references were largely validated by 
worldwide M. tuberculosis database at the CRISPR locus 
(spoligotype patterns). This work encourages the use of 
CRISPR patterns to cluster strains in other organisms 
carrying such loci and for which wide genotyping has 
been undertaken as it is now the case for human patho- 
gens such as Yersinia pestis, Salmonella enterica, Bacill- 
lus anthracis, Mycobacterium leprae and M. ulcerans. 
Affinity propagation could also be useful to cluster 
other genotyping data such as SNPs or minisatellites. 
Databases larger than those available by now are how- 
ever required to test the validity of this method on such 
markers. 

Methods 

Spoligotyping data 

SpolDB4[33], containing 1939 shared international spoli- 
gotypes (SIT) was used as raw data for spoligotyping 
diversity analysis. This database contains family or sub- 
family information, with some uncertainties indicated 
such as LAM3 and S-convergent or T1-T2. To simplify 
the analyses, when two assignations were provided, only 
the first one was kept. We also merged certain families 
when the number of spoligotype patterns was very small 
and not been confirmed by SNP typing [40]. Specifically, 
the families we retained are: africanum (n = 46), animal 
strains (grouping BOVIS, PINNIPEDII, MICROTII, 
CAPRAE, n tot = 231), Beijing (n = 21), CAS (n = 86), 
EAI (n = 213), H (n = 233), LAM (n = 224), MANU (n 
= 39), T (n = 482) including S and H37Rv ST as sug- 
gested by Brudey et al (2006), X (n = 90). We excluded 
SIT69 which was suppressed by Institut Pasteur [33] as 
well as the canetti spoligotype pattern which is unique 
(SIT592). There are 272 unclassified spoligotypes (U). 
The references for each family correspond to SpolDB4 
description [23]; they are listed in Table 1. 

Methods to compute distances 

Three new methods to compute distances were designed 
that fit CRISPR loci evolutionary dynamics such as that 
of M. tuberculosis complex i.e. evolution by deletion or 
transposon insertion. All methods rely on the identifica- 
tion of the beginning and the end of spacers deletions. 
These limits were named Domain Walls (Figure 1). The 
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"Domain Walls" method measures the proportion of 
common Domain Walls between two CRISPR profiles, L 
e. if the profiles have i w and j w Domain Walls respec- 
tively and K w are common, the distance is: 



d-D Walls = 



2K W 



The "Blocks" method considers blocks of spacers; let 
i b and j b be the numbers of blocks carried by the two 
profiles and K b the number of blocks they share, 



^Blocks 



2Kb_ 
k + jb 



The "Deletions" method considers deleted blocks; let 
i d and j d be the numbers of blocks of deletions carried 
by the two profiles and K d the number of shared dele- 
tions, 



^Deletions 



2K d 
id + id 



These distance methods were used to compute the 
distance between each SpolDB4 spoligotype pattern and 
the references of the ten main M. tuberculosis complex 
families. The scripts for such calculations were written 
in R language [51] and are available upon request. Each 
pattern was assigned to the family whose reference it 
was the most similar to. 

Clustering algorithms 

Affinity Propagation (AP), proposed first in [8], is a 
recent clustering method based on the choice of "exem- 
plars" as centers of the clusters, i.e., one representative 
data point for each cluster to which the other nodes 
rely. The choice of the exemplars is based on the mini- 
mization of the total "energy" of the system, function of 
the total distance between data points and exemplars in 
a given clusters configuration. This method falls in the 
class of message-passing type algorithms, exploiting the 
Belief Propagation method (also known as Cavity 
method in the physics literature) to minimize the energy 
function in an computationally efficient way (from the 
exponential time complexity of the naive methods to O 
(N 2 ), where N is the total number of nodes to cluster). 
The starting point is thus a set of data points, represent- 
ing the nodes of the network, and a similarity matrix S 
defining the similarities among all the nodes as deduced 
from the distance between all these nodes. The similar- 
ity between two points i and ; is defined as 

j) = 1 - d{i,j) 

provided that the distance d ranges from 0 to 1, as in 
our case (this is always true up to a normalization of 



the distance). The aim is then to find a map c N] 
-^{1,..., A/}, with N being the total number of data points 
and c(i) = c t is the exemplar of node i, such that the 
vector c = (c\, ...,Cm) minimizes the energy function 

N N 

E(3 = -£SfrcO- »£logOc<(c)) 

1=1 ^ 1=1 

The first term of the function defined above is (minus) 
the sum of all the similarities between a point and its 
exemplar, while the second term is introduced to avoid 
any configuration in which an exemplar does not belong 
to the cluster that itself represents, that is, an exemplar 
must be the exemplar of itself. This is granted by defin- 
ing the function Xi(s) as 



0 a i i n 3 j : Cj = i 

1 otherwise 



and by taking the log function of it and summing it 
over all the nodes, so that the energy becomes infinite if 
at least an exemplar is represented by a different exem- 
plar. The parameter /3 plays formally the role of the 
inverse of the temperature in a thermodynamical sys- 
tem, and thus determines the level of thermal noise act- 
ing on the system. This means that varying this 
parameter, but keeping it finite, allows the algorithm to 
accept configurations of the clusters that do not exactly 
corresponds to minima of the energy function. We con- 
sider here only the optimal case of zero temperature, L 
e., /3 — > oo (for a general and exhaustive treatment of the 
cavity method see, for example [52]). 

Once the Cavity equations are written one is left with 
two coupled update rules for each couple of nodes (i, j): 



r t+i 



a t+1 - 



S(i,j) — max (flj^,- + S(k, z)) 
■- min 0, r*j-+j + max (°' r Uj) 



These update rules represents messages that the nodes 
are exchanging between iteration t and t+1, with r\^- 
and flj^j representing respectively the energetic "com- 
petition" between node i and all the other nodes except 
/ to be the exemplar of node / and the gain in the total 
energy of the system if node j is represented by node L 
The notation i j indicates that the message is sent 
from node i toward node When the update equations 
converge, then the value of each c b i = 1,..., N, is 
obtained summing over all the messages arriving at 
node i and maximizing the sum. The diagonal elements 
of the similarity matrix, that are not constrained to be 
equal to the unity, play the role of an effective cost to 
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every chosen exemplar, and thus a cost for the total 
number of clusters found. They are a fine-tuning for the 
selection of the total number of clusters found by AP. 
In our study we chose to consider every data point a 
priori equally probable to be an exemplar, so we set S(i, 
i) = p V i = 1,..., N. Varying the parameter p from very 
large (negative) values up to positive values gives the 
range of total clusters from 1 to N, and we interpret a 
stability in the total number of clusters under changes 
of this parameter as a genetically meaningful grouping 
of the data, as discussed in the Results section. The 
similarity matrix S was obtained using the "Deletions" 
distance that had turned out to be the most accurate 
distance. Linear combinations of the various distances 
introduced in the previous section were also considered, 
but the overall result still favors the Deletions distance. 
We performed also a comparison of AP with other 
"classical" clustering algorithms, such as K-Means and 
Hierarchical clustering. We considered the performance 
with respect to the experts' classification as defined in 
SpolDB4 [23,33] and identified that AP found clusters 
with much lower error (see Additional File 3). The 
script for computing the distance matrices of SpolDB4 
database and performing the analysis with AP was writ- 
ten in Matlab as a self-contained script, the bare AP 
algorithm for Matlab is available from the authors Frey 
and Dueck. 

Additional material 
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